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Abstract 

We present new results building on the conservative deterministic spec- 
tral method for the space inhomogeneous Boltzmann equation developed 
by Gamba and Tharkabhushaman. This approach is a two-step process 
that acts on the weak form of the Boltzmann equation, and uses the ma- 
chinery of the Fourier transform to reformulate the collisional integral 
into a weighted convolution in Fourier space. A constrained optimization 
problem is solved to preserve the mass, momentum, and energy of the re- 
sulting distribution. We extend this method to second order accuracy in 
space and time, and explore how to leverage the structure of the collisional 
formulation for high performance computing environments. The locality 
in space of the collisional term provides a straightforward memory decom- 
position, and we perform some initial scaling tests on high performance 
computing resources. We also use the improved computational power of 
this method to investigate a boundary-layer generated shock problem that 
cannot be described by classical hydrodynamics. 

1 Introduction 

There are many difficulties associated with numerically solving the Boltzmann 
equation, most notably the dimensionality of the problem and the conservation 
of the collision invariants. For physically relevant three dimensional applica- 
tions the distribution function is seven dimensional and the velocity domain is 
unbounded. In addition, the collision operator is nonlinear and requires evalu- 
ation of a five dimensional integral at each point in phase space. The collision 
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operator also locally conserves mass, momentum, and energy, and any approx- 
imation must maintains this property to ensure that macroscopic quantities 
evolve correctly. 

Spectral methods are a deterministic approach that compute the collision 
operator to high accuracy by exploiting its Fourier structure. These methods 
grew from the analytical works of Bobylev [3] developed for the Boltzmann 
equation for Maxwell type potential interactions and integrable angular cross 
section, where the corresponding Fourier transformed equation takes a closed 
form. Spectral approximations for this type of models where first proposed by 
Pareschi and Perthamc [11]. Later Pareschi and Russo [12] applied this work 
to variable hard potentials by periodizing the problem and its solution and 
implementing spectral collocation methods. 

These methods require 0(N 2d ) operations per evaluation of the collision 
operator, where N is the total number of velocity grid points in each dimension. 
While convolutions can generally be computed in 0(N d log N) operations, the 
presence of the convolution weights requires the full 0{N 2d ) computation of 
the convolution , except for a few special cases, e.g., the Fokker-Planck-Landau 
collision operator [13, 9]. Spectral methods provide many advantages over Direct 
Simulation Monte Carlo Methods (DSMC) because they are more suited to time 
dependent problems, low Mach number flows, high mean velocity flows, and 
flows that are away from equilibrium. In addition, deterministic methods avoid 
the statistical fluctuations that are typical of particle based methods. 

Inspired by the work of Ibragimov and Rjasanow [8], Gamba and Tharkab- 
hushanam [5, 6] observed that the Fourier transformed collision operator takes 
a simple form of a weighted convolution and developed a spectral method based 
on the weak form of the Boltzmann equation that provides a general framework 
for computing both elastic and inelastic collisions. Macroscopic conservation 
is enforced by solving a numerical constrained optimization problem that finds 
the closest distribution function in L2 to the output of the collision term that 
conserves the macroscopic quantities. These methods do not rely on periodiza- 
tion but rather on the use of the FFT tool in the computational domain, where 
convergence to the solution of the continuous problem is obtained by the use of 
the extension operator in Sobolev spaces. 

This paper presents extensions to this method and its implementation on 
high performance computing resources. Without loss of generalization, we re- 
strict this presentation to elastic collisions. We present a second order in time 
and space extension of the Gamba and Tharkabhushanam method allowing for 
nonuniform grids. This method has been implemented on the Lonestar super- 
computer at the Texas Advanced Computing Center (TACC) and timing studies 
are provided to explore the scaling of the method to more and more processors 
for large problems. Finally we present some ID in physical space results for a 
problem proposed by Aoki et al. [1] where a sudden change in wall temperature 
results in a shock that cannot be explained by classical hydrodynamics. 
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2 The space inhomogeneous Boltzmann equa- 
tion 



The space inhomogeneous initial-boundary value problem for the Boltzmann 
equation is given by 

^/(x,v,i)+v V x /(x,v,i) = - £ Q{f,f), (1) 

with 

xeSlcR'', v e R d 

/(x,v,0) = / (x,v) 

/(x, v,i) = / B (x,v,i), x e 90. 

where f(v,t) is a probability density distribution in v-space and /o is assumed 
to be locally integrable with respect to v and the spatial boundary condition 
/s will be specified in below. The dimensionless parameter e > is the scaled 
Knudsen number, which is defined as the ratio between the mean free path 
between collisions and a reference macroscopic length scale. 

The collision operator Q(f, f)(x,v,t) is a bilinear integral form local in t 
and x and is given by 

Q(/,/)(-,v,-)= f f B(\v-v4,co S 6)(f(v:)f(v')-f(v*)f(v))dadv*, 

(2) 

where the velocities v', v'„ are determined through a collision rule (3), depending 
on v, v*„ and the positive term of the integral in (2) evaluates / in the pre- 
collisional velocities that will take the direction v after an interaction. The 
collision kernel B(\v — v*|,cos#) is a given non- negative function depending 
on the size of the relative velocity u := v — v* and cos 8 = where a in 
the n — 1 dimensional sphere S n ~ 1 is referred as the scattering direction of the 
post-collisional elastic relative velocity. 

For the following we will use the velocity elastic (or reversible) interaction 
law in center of mass-relative velocity coordinates 

v' = v+i(|u|a-u), v:=v*-i(|u|a-u) (3) 

B(|u|,cos0) = |u| A fe(cos0). 

We assume that the differential cross section function &(cos 9) is integrable with 
respect to a on referred to as the Grad cut-off assumption, and that it is 

renormalized such that 

/ b(cos6)da = 1. (4) 
Js d - 1 

The parameter A regulates the collision frequency as a function of the relative 
speed |u|. This corresponds to the interparticle potentials used in the derivation 
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of the collisional kernel and are referred to as variable hard potentials (VHP) for 
< A < 1, hard spheres (HS) for A = 1, Maxwell molecules (MM) for A = 0, and 
variable soft potentials (VSP) for — 3 < A < 0. The A = — 3 case corresponds to 
a Coulombic interaction potential between particles. If &(cos 9) is independent 
of a we call the interactions isotropic (like the case of hard spheres in three 
dimensions) . 

2.1 Boundary conditions 

On the spatial boundary Oft we use a diffusive Maxwell boundary condition 
which is given by, for x G dfl, 

«^'') = »S"(-^)' (v-V„,.„>0 (5) 
~5?fr) / (v- V w )-n/(x,v,i)dv, 

nl w J J(v-V„)-n<0 

where V m and T w are the wall velocity and temperature, respectively, and n is 
the unit normal vector to the boundary, directed into fl. The term a w accounts 
for the amount of particles leaving the domain and ensures mass conservation 
in 0. 



2.2 Spectral formulation 

The key step our formulation of the spectral numerical method is the use of 
the weak form of the Boltzmann collision operator. For a suitably smooth test 
function <f>(y) the weak form of the collision integral is given by 

/ 0(/,/)^(v)dv= / /(v)/(v,)S(|u|,cose)(0(v , )-0(v))dtrdv.dv 

JR d JR d xR d xS d - 1 

(6) 

If one chooses 

0(v) = e-^ v /(\/2^) d , 
then (6) is the Fourier transform of the collision integral with Fourier variable 



$(c) = (7^I w ' /)e ""' Vdv 

= / /(v)/W*^(e^-e-^.dv (7) 

JR d xR d xs d - 1 (V2ir) a 

= [ G(u,Cm/(v)/(v-u)](C)du, (8) 

JR d 

where [•] = J"(-) denotes the Fourier transform and 



G(u,C) = |u| A J b(cos8) (e-i§C-\n\a e i§C-u _ ^ da 



(9) 
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Further simplification can be made by writing the Fourier transform inside the 
integral as a convolution of Fourier transforms: 

q(0= / G(£,o/(c-o/(Ode, (io) 

where the convolution weights G(£, () are given by 

G(£, C) = — 1=- / G(u, C)e-*»du (11) 

These convolution weights can be precomputed once to high accuracy and stored 
for future use. For many collision types the complexity of the integrals in the 
weight functions can be reduced dramatically through analytical techniques. In 
this paper we will only consider isotropic scattering in dimension 3 (6(cos#) = 
1/4tt). In this case we have that 



G(£, C) = j^=y 3 r A+2 (sine sinc(r|£ - ^\) sinc(r|£|)) dr. 

(12) 

This integral will be cut off at a point r = r , which will be determined 
below. Given this cutoff point, we can explicitly compute G for integer values 
of A. For other values of A, this is simply a one-dimensional integral that can be 
precomputed to high accuracy using numerical quadrature. The entirety of the 
collisional model being used is encoded in the weights, which gives the algorithm 
a large degree of flexibility in implementing different models. 



3 The Conservative Numerical Method 

3.1 Temporal and velocity space discretization 

We use an operator splitting method to separate the mechanisms of collisions 
and advection. The system is split into the subproblcms 

^/ + «-V x / = (13) 

§- t f = Q(f,f), (14) 

which are solved separately. 

Each system is evolved in time using a second-order Runge-Kutta method, 
and the systems are combined using Strang splitting. 

We assume that the distribution function is negligible outside of a ball 

B«.(V(x)) = {veR d :|v- V(x)| < R x }, (15) 
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where V(x) is the local flow velocity which depends in the spatial variable x. 
For ease of notation in the following we will work with a ball centered at and 
choose a length R large enough that Br x (V(x)) C -Br(O) for all x. 

With this assumed support for the distribution /, the integrals in (10) will 
only be nonzero for u € B2r(0). Therefore, we set L = 2R and define the cube 

C L = {v€M. d :\ Vj \<L, j = l,...,d} (16) 

to be the domain of computation. For such domain, the computation of the 
weight function integral (12) is cut off at ro = L. 

Let N e N be the number of points in velocity space in each dimension. 
Then the uniform velocity mesh size is Aw = ^ anc l d ue to the formulation 
of the discrete Fourier transform the corresponding Fourier space mesh size is 
given by AC = J. 

The mesh points are defined by 

v k = Av(k - N/2) 

C k = AC(k - N/2) (17) 
k= (ki,...,k d ) € Z d , < kj < N- 1, j = l,...,d (18) 

3.2 Collision step discretization 

Returning to the spectral formulation (10), the weighted convolution integral 
then becomes an integral over — < £j < j = l,...,d. 

To simplify notation we will use one index to denote multidimensional sums 
with respect to an index vector m 

N-l N-l 

E = E • 

m— m 1 ,....m d — 

To compute Q(Ck), we first compute the Fourier transform integral giving 
/(Cfc) via the FFT. The weighted convolution integral is approximated using the 
trapezoidal rule 

N-l 

QKk) = E d ^™> Ck)/(U)/(Ck - U)Wm, (19) 

m=0 

where w m is the quadrature weight and we set /(Ck — ^m) = if (Ck — Cm) is 
outside of the domain of integration. We then use the inverse FFT on Q to 
calculate the integral returning the result to velocity space. 

Note that in this formulation the distribution function is not periodized, as 
is done in the collocation approach of Pareschi and Russo [12]. This is reflected 
in the omission of Fourier terms outside of the Fourier domain. All integrals 
are computed directly only using the FFT as a tool for faster computation. The 
convolution integral is accurate to at least the order of the quadrature. The 
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calculations below use the trapezoid rule, but in principle Simpson's rule or 
some other uniform grid quadrature can be used. However, it is known that the 
trapezoid rule is spectrally accurate for periodic functions on periodic domains 
(which is the basis of spectral accuracy for the FFT), and the same arguments 
can apply to functions with sufficient decay at the integration boundaries [2]. 
These accuracy considerations will be investigated in future work. The overall 
cost of this step is 0{N 2d ). 

3.3 Discrete conservation enforcement 

This implementation of the collision mechanism does not conserve all of the 
quantities of the collision operator. To correct this fact, we formulate these 
conservation properties as a constrained optimization problem as proposed in 
[5, 6]. Depending on the type of collisions we can change this constraint set (for 
example, inelastic collisions do not preserve energy). We focus here just on the 
case of elastic collisions, which preserve mass, momentum, and energy. 

Let M = N d be the total number of grid points, let Q = (Qi, . . . , Qm) T be 
the result of the spectral formulation from the previous section, written in vector 
form, and let uij be the quadrature weights over the domain in this ordering. 
Define the integration matrix 



where v l , i = 1, 2, 3 refers to the ith component of the velocity vector. Using this 
notation, the conservation method can be written as a constrained optimization 
problem. 

Find Q = (Q 1 ,...,Q M ) T that minimizes ^ || Q — Qlll such that CQ = (20) 
The solution is given by 



The overall cost of the conservation portion of the algorithm is a 0(N d ) 
matrix- vector multiply, significantly less than the computation of the weighted 
convolution. 

3.4 Spatial and Transport discretization 

For simplicity this will be presented in ID in space, though the ideas apply to 
higher dimensions. In this case the transport equation reduces to 




Q = Q + C(CC T ) : CQ 
= PnQ 



(21) 




t) = 0. 
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We partition the domain into cells of size Axj (not necessarily uniform) 
with cell centers Xj . Using a finite volume approach, we integrate the transport 
equation over a single cell to obtain 

/; +1 (^-;> t ) , f^zILhi n 

At + Axj 

where t n — nAt and -F™ tl / 2 i s an approximation of the edge fluxes v\f of the 
cell between time t n and t n+1 . We use a second order upwind scheme defined 
by 

J+1/2 \vi(f? + i-¥*? + i), otherwise, V ' 

where aj is a cell slope term used in the reconstruction defined by the minmod 
limiter. 

On wall boundaries the incoming flux is determined using ghost cells and 
the diffusive reflection formula (5). For problems without meaningful bound- 
ary interactions (e.g. shocks), a no- flux boundary condition is applied for the 
incoming characteristics. 



4 Parallelization 

The major bottleneck in parallelizing a program is memory access times related 
to communication between processes. However, the relative locality of the dy- 
namics of the Boltzmann equation allow for a straightforward decomposition of 
phase space. In each time step, a single grid point only "sees" the particles at 
the same spatial grid point, through the collision term, and particles with the 
same velocity at neighboring spatial grid points, through the transport term. 
As most of the computational time is spent on the collision term we choose to 
keep all of the information needed for this step on the same computational node, 
and thus partition the spatial grid points across the computational nodes. 

Further parallelization can be realized on each node by noting that the com- 
putation of Q(C) is simply a sum involving the precomputed weights G, the 
distribution function /, and (, and is completely independent of computation 
on another velocity grid point. We use OpenMP [10] to distribute the weighted 
convolutions to each computational core on the node. Each core computes the 
convolution for a single grid point in £, marching through until all of the convo- 
lutions on the node are complete. Because there is no memory transfer required 
between cores, this should speed up the computation of the collisional terms by 
a factor of p, where p is the total number of cores used in the computation. 

We use the Message Passing Interface (MPI) [4] to communicate distribu- 
tion function data between nodes using an interleaved ghost cells technique [7] . 
Each node maintains two spatial ghost cells on the left and right sides of its 
local spatial domain, which are filled by the distribution functions sent from the 
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neighboring node. Once this information is received, the node has enough in- 
formation to update the "regular" spatial grid points through the finite volume 
transport scheme described above. 

To make a rough estimate of the total speedup let n be the number of nodes 
used, and np be the number of cores used (assume a fixed number of cores per 
node). Then the speedup can be described by, to leading order, 

^SERIAL _ CN x N 6 Tpi,op 

Tparallel ~ 4nN 3 T M EM + CN x N 6 T FLOP /np 1 

where Tmem is the time to transfer a single floating point piece of data, and 
Tflop is the time for a single floating point operation. As n becomes large with 
N and N x fixed, more and more data transfer is required, however one would 
need ^v?pTmem /Tflop ~ CN X N 3 before the memory access time would begin 
to dominate the collisional computations. 



5 Numerical results 

We test this method with the sudden change in wall temperature example sug- 
gested by Aoki et al. [1]. In this problem the gas is initially at equilibrium and 
the temperature of the wall at the boundary of the domain is instantaneously 
changed at t = 0. This gives rise to a discontinuous distribution function at the 
wall, which propagates into the domain and eventually forms a shock on the 
interior of the domain. This problem is especially well suited to a deterministic 
method because the discontinuity near the wall results in a distribution func- 
tion that is far from equilibrium, which is difficult to simulate with Monte Carlo 
based solvers. 

In Figure 5 we show shock formation due to a sudden change in wall temper- 
ature. Unlike the computations in [I], only two grid points are used per mean 
free path in the interior of the domain. Near the wall, the grid is refined to 
eight points per mean free path to better capture the finer dynamics where the 
distribution function is discontinuous. In both examples, the scaled Knudsen 
number e — 1. Note that despite the discontinuity we do not observe any Gibbs 
phenomena in the solution. We hypothesize that this is due to the mixing effects 
of the convolution weights; this will be explored in future work. 

In Table 1 we show the wall time scaling results for the sudden heating 
example in Figure 1. This code was executed on the TACC supercomputer 
Lonestar, which has twelve cores per node. There is a jump in computational 
time when moving from a single node to two nodes, but every doubling of the 
node amount thereafter results in an almost exact halving of the computational 
time required with the previous number of nodes scowing near-perfect linear 
scaling for this range of nodes. 
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Figure 1: Formation of shock from suddent change in wall temperature. Evolu- 
tion of bulk velocity. 




nodes 


cores 


time (s) 


1 


12 


203 


2 


24 


235.3 


4 


48 


120.8 


8 


96 


61.4 


16 


192 


30.9 


32 


384 


15.2 


64 


768 


7.7 



Table 1: Computational time for a single timestep in sudden heating example 
from Figure 1. 
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6 Conclusions 



We have extended the spectral method of Gamba and Tharkabhushanam to a 
second order scheme with a nonuniform grid in physical space, and investigated 
its scaling to high performance computers. The method showed nearly linear 
speedup when applied to a large computation problem solved on a supercom- 
puter. However, at some point memory access will still become a problem, not 
with transfer between nodes but with memory access on a single node. The 
most expensive object to store is the six dimensional weight array G(£,£)> an d 
if TV becomes too large it may not fit on a single node's memory, significantly 
slowing down computation. At that point, it may be faster to simply compute 
the weights on the fly, as flops are much cheaper than memory accesses on the 
large distributed systems used in high performance computing today. Future 
advances in hardware such as Intel's new Many Integrated Core (MIC) nodes, 
which will be used in TACC's new Stampede computer starting in 2013, have 
many more cores and much more local memory than ever before. Future work 
will investigate scaling when pushing the bounds of both number of cores p of 
the system as well as the memory requirements on each node. 
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